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Abstract: This paper studies the estimation of a large covariance matrix. We introduce a 
novel procedure called ChoSelect based on the Cholesky factor of the inverse covariance. This 
method uses a dimension reduction strategy by selecting the pattern of zero of the Cholesky 
factor. Alternatively, ChoSelect can be interpreted as a graph estimation procedure for di- 
rected Gaussian graphical models. Our approach is particularly relevant when the variables 
under study have a natural ordering (e.g. time series) or more generally when the Cholesky 
factor is approximately sparse. ChoSelect achieves non-asymptotic oracle inequalities with re- 
spect to the Kullback-Leibler entropy. Moreover, it satisfies various adaptive properties from 
a minimax point of view. We also introduce and study a two-stage procedure that combines 
ChoSelect with the Lasso. This last method enables the practitioner to choose his own trade-off 
between statistical efficiency and computational complexity. Moreover, it is consistent under 
weaker assumptions than the Lasso. The practical performances of the different procedures 
are assessed on numerical examples. 
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1. Introduction 

The problem of estimating large covariance matrices has recently attracted a lot of attention. On 
the one hand, there is an inflation of high-dimensional data in many scientific areas: gene arrays, 
functional magnetic resonance imaging (fMRI), image classification, and climate studies. On the 
other hand, many data analysis tools require an estimation of the covariance matrix S. This is for 
instance the case for principal component analysis (PC A), for linear discriminant analysis (LDA), 
or for establishing independences or conditional independences between the variables. It is known 
for a long time that the simplest estimator, the sample covariance matrix performs poorly when the 
size of the vector p is larger than the number of observations n (see for instance Johnstone [16]). 

Depending on the objectives of the analysis and on the applications, different approaches are used 
for estimating high-dimensional covariance matrices. Indeed, if one wants to perform PCA or to 
establish independences between the covariates, then it is advised to estimate directly the covariance 
matrix S. In contrast, performing LDA further relies on the inverse of the covariance matrix. In the 
sequel, we call this matrix the precision matrix and note it f2. Sparse precision matrices are also 
of interest because of their connection with graphical models and conditional independence. The 
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pattern of zero in fl indeed corresponds to the graph structure of the distribution (see for instance 
Lauritzen [20] Sect.5.1.3). 

Most of the methods based on direct covariance matrix estimation amount to regularize the 
empirical covariance matrix. Let us mention the work of Ledoit and Wolf [21] who propose to replace 
the sample covariance with its linear combination with the identity matrix. However, these shrinkage 
methods arc known to provide an inconsistent estimation of the eigenvectors [17]. Applying recent 
results on random matrix theory, El Karoui [11] and Bickel and Levina [5] have studied thresholding 
estimators of E. The resulting estimator is sparse and is proved (for instance [5]) to be consistent 
with respect to the operator norm under mild conditions as long as log(p)/n goes to 0. These results 
are particularly of interest for performing PCA since they imply a consistent estimation of the 
eigenvalues and the eigenvectors. Observe that all these methods are invariant under permutation 
of the variables. Yet, in many applications (for instance times series, spectroscopy, climate data), 
there exists a natural ordering in the data. In such a case, one should use other procedures to obtain 
faster rates of convergence. Among other, Furrer and Bentgsson [14] and Bickel and Levina [6] use 
banded or tapering estimators. Again, the consistency of such estimators is proved. Moreover, all 
these methods share an attractive computational cost. We refer to the introduction of [5] for a more 
complete review. 

The estimation procedures of the precision matrix 17 fall into three categories depending whether 
there exists an ordering on the variables and to what extent this ordering is important. If there 
is not such an ordering, d'Aspremont et al. [3] and Yuan and Lin [33] have adopted a penalized 
likelihood approach by applying a l\ penalty to the entries of the precision matrix. It has also been 
discussed by Rothman et al. [27] and Friedman et al. [13] and extended by Lam and Fan et al. [19] or 
Fan et al. [12] to other penalization methods. These estimators are known to converge with respect 
to the Frobenius norm (for instance [27]) when the underlying precision matrix is sparse enough. 

When there is a natural ordering on the covariates, the regularization is introduced via the 
Cholcsky decomposition: 

Q = T*5 _1 T , 

where T is a lower triangular matrix with a unit diagonal and S is a diagonal matrix with positive 
entries. The elements of the i-th row can be interpreted as regression coefficient of i-th component 
given its predecessors. This will be further explained in Section 2.1. For time series or spectroscopy 
data, it is more likely that the relevant covariates for this regression of the i-th component are its 
closest predecessors. In other word, it is expected that the matrix T is approximately banded. With 
this in mind, Wu and Pourahmadi [31] introduce a fc-banded estimator of the matrix T by smoothing 
along the first k subdiagonals and setting the rest to 0. The choice of k is made by applying 
AIC (Akaikc [1]). They prove clement- wise consistency of their estimator but did not provide any 
high-dimensional result with respect to a loss function such as Kullback or Frobenius. Bickel and 
Levina [6] also consider fc-banded estimator of T and are able to prove rates of convergence in the 
matrix operator norm. Moreover, they introduce a cross-validation approach for choosing a suitable 
k, but they do not prove that the selection method achieves adaptiveness. More recently, Levina 
et al. [22] propose a new banding procedure based on a nested Lasso penalty. Unlike the previous 
methods, they allow the number k = ki used for banding to depend on the line i of T. They do not 
state any theoretical result, but they exhibit numerical evidence of its efficiency. In the sequel, we 
call the issue of estimating SI by banding the matrix T the banding problem. 

Between the first approach based on precision matrix regularization and the second one which 
relies on banding the Cholesky factor, there exists a third one which is not permutation invariant, 
but does not assume that the matrix T is approximately banded. It consists in approximating T 
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by a sparse lower triangular matrix (i.e. most of the entries are set to 0). 

When is it interesting to adopt this approach? If we consider a directed graphical model whose 
graph is sparse and compatible with the ordering of the variables, then the Cholesky factor T is 
sparse. Indeed, its pattern of zero is related to the directed acyclic graph (DAG) of the directed 
graphical model associated to this ordering (see Section 2.1 for a definition). More generally, it may 
be worth using this strategy even if one does not know a "good" ordering on the variables. On the 
one hand, most of the procedures based on the estimation of T are computationally faster than 
their counterpart based on the estimation of CI. This is due to the decomposition of the likelihood 
into p independent terms explained in Section 3. On the other hand, there exist examples of sparse 
Cholesky factor T such that the precision matrix CI is not sparse at all. Consider for instance a 
matrix T which is zero except on the diagonal and on the last line. Admittedly, it is not completely 
satisfying to apply a method that depends on the ordering of the variables when we do not know a 
good ordering. There are indeed examples of sparse precision matrices CI such that for a bad ordering, 
the Cholesky factor is not sparse at all (see [27] Sect. 4). Nevertheless, if sparse precision matrices 
and sparse Cholesky factors have different approximation capacities, it remains still unclear which 
one should be favored. 

In the sequel, we call the issue of estimating T in the class of sparse lower triangular matrices 
the complete graph selection problem by analogy to the complete variable estimation problem in 
regression problems. In this setting, Huang et al. [15] propose to add an li penalty on the elements 
of T. More recently, Lam and Fan [19] have extended the method to other types of penalty and 
have proved its consistency in the Frobenius norm if the matrix T is exactly sparse. To finish, let us 
mention that Wagaman and Levina [30] have developed a data-driven method based on the isomap 
algorithm for picking a "good" ordering on the variables. 

In this paper, we consider both the banding problem and the complete graph selection problem. 
We introduce a general Iq penalization method based on maximum likelihood for estimating the 
matrices T and S. We exhibit a non-asymptotic oracle inequality with respect to the Kullback loss 
without any assumption on the target Cl. 

For the adaptive banding issue, our method is shown to achieve the optimal rate of convergence 
and is adaptive to the rate of decay of the entries of T when one moves away from the diagonal. 
Corresponding minimax lower bounds are also provided. We also compute asymptotic rates of 
convergence in the Frobenius norm. Contrary to the l\ penalization methods, we explicitly provide 
the constant for tuning the penalty. Finally, the method is computationally efficient. 

For complete graph selection, we prove that our estimator non-asymptotically achieves the op- 
timal rates of convergence when T is sparse. We also provide the corresponding minimax lower 
bounds. To our knowledge, this minimax lower bounds with respect to the Kullback discrepansy 
are also new. Moreover, our method is flexible and allows to integrate some prior knowledge on the 
graph. However, this procedure is computationally intensive which makes it infeasiblc for p larger 
than 30. This is why we introduce in Section 7 a computationally faster version of the estimator by 
applying a two-stage procedure. This method inherits some of the good properties of the previous 
method and applies for arbitrarily large p. Moreover, it is shown to select consistently the pattern 
of zeros under weaker assumptions than the Lasso. These theoretical results are corroborated by a 
simulation study. 

Since data analysis methods like LDA are based on likelihood we find more relevant to obtain 
rates of convergence with respect to the Kullback-Leibler loss than Frobenius rates of convergence. 
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Moreover, considering Kullback loss allows us to obtain rates of convergence which are free of hid- 
den dependency on parameter such as the largest eigenvalue of E. In this sense, we argue that this 
loss function is more natural for the statistical problem under consideration. 

The paper is organized as follows. Section 2 gathers some preliminaries about the Cholcsky 
decomposition and introduces the main notations. In Section 3, we describe the procedure and 
provide an algorithm for computing the estimator f2. In Section 4, we state the main result of 
the paper, namely a general non-asymptotic oracle type inequality for the risk of f2. In Section 
5, we specify our result to the problem of adaptive banding. Moreover, we prove that our so- 
defined estimator is minimax adaptive to the decay of the off-diagonal coefficients of the matrix T. 
Asymptotic rates of convergence with respect to the Frobenius norm are also provided. In Section 6, 
we investigate the complete graph selection issue. We first derive a non-asymptotic oracle inequality 
and then derive that our procedure is minimax adaptive to the unknown sparsity of the Cholesky 
factor T. As previously, we provide asymptotic rates of convergence with respect to the Frobenius 
loss function. Moreover, we introduce a computationally feasible estimation procedure in Section 
7 and we derive an oracle-type inequality and sufficient condition for consistent selection of the 
graph. In Section 8, the performances of the procedure are assessed on numerical examples for 
both the banding and the complete graph selection problem. We make a few concluding remarks 
in Section 9. Sketch of the proof are in Section 10, while the details are postponed to the technical 
Appendix [29]. 

2. Preliminaries 

2.1. Link with conditional regression and graphical models 

In this subsection, we review basic properties about Cholesky factors and explain their connection 
with directed graphical models. 

We consider the estimation of the vector X = (-Xj)i<j< p of size p which follows a centered normal 
distribution with covariance matrix E. We always assume that E is non-singular. We recall that 
the precision matrix Q uniquely decomposes as f2 = T*ST where T is a lower triangular matrix 
with unit diagonal and S is a diagonal matrix. Let us first emphasize the connection between the 
modified Cholesky factor T and conditional regressions. For any i between 2 and p we note U the 
vector of size i — 1 made of the i — 1-th first elements of the ith-linc of T. By convention t\ is the 
vector of null size. Besides, we note Sj the i-th diagonal element of the matrix S. Let us define the 
vector e = (ei)i<i< p of size p as e := TX. By standard Gaussian properties, the covariance matrix 
of e is S. Since the diagonal of T is one, it follows that for any 1 < i < p 



where Var(ei) = s, and the (ej)i<j<j, are independent. 

Let ~G be a directed acyclic graph who vertex set is {1, . . . ,p}. We assume that the direction of 
the edges is compatible with the natural ordering of {1, ... ,p}. In other words, we assume that any 




(1) 
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edge j — > i in G satisfies j < i. Given a vertex i, the set of its parents is defined by: 

pa-g(i) := {j < i, j -> i} . 

Then, the vector X is said to be a directed Gaussian graphical model with respect to ~d if for 
any 1 < j < i < p such that j ^ pa-^(i), Xi is independent of Xj conditionally to (Xk)ke pa -g(i)- 
This means that only the variables (^fc)feepa-g(i) are relevant for predicting Xi among the variables 
(Xk)k<i- There are several definitions of directed Gaussian graphical model (see Lauritzen [20]), 
which are all equivalent when E is non-singular. 

There exists a correspondence between the graph and the Cholesky factor T of the precision 
matrix ft. If X is a directed graphical model with respect to G^, then T\i,j] = for any j < i such 
that j -¥* i. Conversely, X is a directed graphical model with respect to the graph G^ defined by 
j — > i if and only T[i,j] ^ 0. Hence, it is equivalent to estimate the pattern of zero of T and the 
minimal graph ~G compatible with the ordering. 

These definitions and properties depend on a particular ordering of the variables. It is beyond 
the scope of this paper to discuss the graph estimation when the ordering is not fixed. We refer the 
interested reader to Kalisch and Biihlmann [18]. 

2.2. Notations 

For any set A, \A\ stands for its cardinality. We are given n independent observations of the random 
vector X. We always assume that X follows a centered Gaussian distribution Af(0 p , £). In the sequel, 
we note X the n x p matrix of the observations. Moreover, for any 1 < i < p and any subset A of 
{1, . . . ,p — 1}, Xi and X,i respectively refer to the vector of the n observations of Xi and to the 
n x \A\ matrix of the observations of (Xi)i e A- 

In the sequel, /C (Q; f2') stands for the Kullback divergence between the centered normal distribu- 
tion with covariance f2 -1 and the centered normal distribution with covariance We shall also 
sometimes assess the performance of the procedures using the Frobenius norm and the I2 operator 
norm. This is why we respectively define ||j4|||. := Si j ^[*ii] 2 and ||A|| as the Frobenius norm and 
the I2 operator norm of the matrix A. For any matrix fi, </9 max (^) stands for the largest eigenvalue 
of n. Finally, L, L\, ^2,- • ■ denote universal constants that can vary from line to line. The notation 
L specifies the dependency on some quantities. 

3. Description of the procedure 

In this section, we introduce our procedure for estimating ft given a n-sample of the vector X. For 
any i between 1 and p, mi stands for a subset of {1, . . . , i — 1}. By convention, m\ = 0. In terms 
of directed graphs, m, stands for the set of parents of i. Besides, we call any set m of the form 
m = mi x mi x . . . x m p a model. This model m is one to one with a directed graph whose ordering 
is compatible with the natural ordering of {1, ... ,p}. We shall sometimes call m a graph in order 
to emphasize the connection with graphical models. 

Given a model m, we define T m as the affine space of lower triangular matrices T with unit 
diagonal such for any i between 1 and p, the support (i.e. the non-zero coefficients) of ij is included 
in m,. We note Diag{p) the set of all diagonal matrices with positive entries on the diagonal. The 
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matrices T m and S m are then defined as the maximum likelihood estimators of T and S 

(f m , S m ) = arg min C n (T, S) := Ur PrS^TX^X] + J log \S\ (2) 

V / T'eT m , S'GDiag(p) 2 L J 2 

Here, C„(T,S) stands for the negative log-likelihood. Hence, the estimated precision matrix is 
Q m = T^S^ T m . This matrix fl m is the maximum likelihood estimator of ft among the precision 
matrices which correspond to directed graphical models with respect to the graph m. 

For any i between 1 and p, Aii refers to a collection of subsets of {1, . . . , i — 1} and we call 
Ai := Mi x . . . x Ai p a collection of models (or graphs). The choice of the collection Ai depends on 
the estimation problem under consideration. For instance, we shall use a collection corresponding 
to banded matrices when we will consider the banding problems. The collections Ai are specified 
for the banding problem and the complete graph selection problem in Sections 5 and 6. 

Our objective is to select a model m <E Ai such that the Kullback-Lcibler risk E[/C(fi; TO )] is 
as small as possible. We achieve it through penalization. For any 1 < i < p, pent : Aii — > K + 
is a positive function that we shall explicitly define later. The penalty function pen : Ai —> M + 
is defined as pen(m) = Y^ V i=iP en i{ m i)- Then, we select a model in that minimizes the following 
criterion 

m := arg min 2£„(T m , S m ) + pen(m) = arg min tr 
For short, we write := flfh, S := Sfn, and T = T m . 

As mentioned earlier, the idea underlying the use of the matrices T and S lies in the regression 
models (1). Indeed, these regressions naturally appear when deriving the negative log-likelihood 
(2): 

2£„(T', S>) = J2 IIX, + X <i «.)*lln + log(4) , 

i=l 

where ||.|| n stands for the Euclidean norm in R™ divided by \Jn. By definition of T m and S m , 
we easily derive that the i-th row vector ti. mi of T m and the i-th. diagonal element Si <mi of S m 
respectively equal 

U,mi = arg min ||Xj + X<i(^)* || 2 n and s 2 l nii = ||Xj + X <!: t* m . \\ 2 n , (3) 

supp(^)Cmi 

for any 1 < i < p. Here, supp(f^) stands for the support of t^. Hence, the row vector U imi is the least- 
squares estimator of ti in the regression model (1) and Si j7ni is the empirical conditional variance of 
Xi given X mi . There are two main consequences: first, Expression (3) emphasizes the connection 
between covariance estimation and linear regression in a Gaussian design. Second, it highly simplifies 
the computational cost of our procedure. Indeed, the negative log-likelihood £„(T m , S m ) now writes 

— 1 p 

£„ (f m , S m J = 2 X! t log (^> m «) + ^ • 
i=l 

and it follows that fhi = argmin TOi g J vf j log(si jlni ) + peni(nii). This is why we suggest to compute 
fh and f2 as follows. Assume we are given a collection of graphs Ai = (Aii, . . . ,Ai p ) and penalty 
functions (pen\(.), . . . ,pen p (.)). 
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Algorithm 3.1. Computation of fh andVl. 

1. For i going from 1 to p, 

• Compute Si, mi for each model rrii € Mi- 

• Take TOj = argmin mi(E7Mi log(si, m J +pen i (m i ). 

2. Set fh = (mi, . . . ,fh p ) and built (T , S) by gathering the estimators (^m^Si.mJ- 

3. Take Q = TS^f. 

In what follows, we refer to this method as ChoSelect. In order to select fh, one needs to 
compute all Si TOj for any i <= {1, . . . ,p} and any model m, G M-i- Hence, the complexity of the 
procedure is proportional to X)f=i l-^il- We further discuss computational issues and we provide a 
faster procedure in Section 7. 



4. Risk analysis 



In this section, we first provide a bias- variance decomposition for the Kullback risk of the parametric 
estimator fl m . Afterwards, we state a general non-asymptotic risk bound for Q. 



4-1. Parametric estimation 



Let m be model in M. . Let us define the matrix f2 m as the best approximation of f2 that corresponds 
to the model to. The matrices T m and S m are defined as the minimizers in T m and Diag(p) of the 
Kullback loss with O 

(T m , S m ) := arg min K (SI; T^S'^T') 

T'eT m , S'GDiag(p) V ' 

We note Sl m = T^S^T m . 

We define the conditional Kullback-Leibler divergence of the distribution of Xi given X<i by 

IC(t u s t ;t[,s^ ^EllC^.^X^X^P^X^X^)]} , (4) 

where Pj ijSi (Xj|A<i) stands for the conditional distribution oi Xi given X<i with parameters (ti,Si). 
Applying the chain rule, we obtain that K.(Sl; SI') = Yli=i ^ (tii s i> t'n s 'i)- Consequently, we analyze 
the Kullback risk E[/C(f2; Sl m )] by controlling each conditional risk E [/C(ij, s,; U imi , s"i, mi )] . Let us 
define f, jTOj and Sj imj as the projections of (t,-, a,-) on the space associated to the model to^ with 
respect to the Kullback divergence K,(ti, sf, ., .). In other words, ti. mi and Si jTOi satisfy 



supp(iJ)Cmi 



(A, + X^t'iYY and s liTOs = Var (A i |A <i ) 



Applying the chain rule, we check that tj )OTj corresponds to (i — l)-th first elements of the i-th line 
of T m and Si, mi is the i-th diagonal element of S m . Thanks to the previous property, we derive a 
bias- variance decomposition for the Kullback risk E [/C(i,, sf, s", )OTj , s*i. mi )]. 



imsart-ejs ver. 2010/09/07 file: Chol_ejs.tex date: October 11, 2010 



N. Verzelen/Covariance estimation 



8 



Proposition 4.1. Assume that \rrii\ is smaller than n — 2. The Kullhack risk of (ii, mi , Si, mi ) de- 
composes as follows 



where R n ,d is defined by 



(5) 



d+1 



d(d + 1) 



* (n - d) + log 1 

n 



n-d-2 2(n-d-l)(n-d-2) 2 
ana *(n — d) := E [log (x 2 (n — d)/(n — d))] . Besides, R n ,d is bounded as follows 



d+1 d+1 

< Rn.d < 



2(n-d-2) 

and i? rl . d 



1 

n-d-2 + 2 
d+1 
2(n-d-2) 



d+1 



i 2 



d- 2 



d+1 



An explicit expression of R n ,d is provided in the proof. Applying the chain rule, we then derive 
a bias- variance decomposition for the maximum likelihood estimator fl m . 

Corollary 4.2. Let m = (mi,...,m p ) be a model such that the size |mj| of each submodel is 
smaller than n — 2. Then, the Kullback risk of the maximum likelihood estimator Q TO decomposes 
into 



3 



v 

JC^Q-.flm^ — JC (£1; fl m ) + Rn,\m, 



If the size \mi\ of each submodels is small with respect to n, the variance term is of the order 
Sr=i(l TO ;l + l)/[2(7i — \m,i\ — 2)]. For other loss functions such as the Frobenius norm or the li 
operator norm between Q and O m , there is no such bias- variance decomposition with a variance 
term that does not depend on the target. 



4-2. Main result 



In this subsection, we state a general non-asymptotic oracle inequality for the Kullback-Lciblcr risk 
of the estimator f2. We shall consider two types of penalty function pen(.): the first one only takes 
into account the complexity of the model collection while the second is based on a prior probability 
on the model collection. 

Definition 4.3. For any integer i between 2 and p, the complexity function Hi(.) is defined by 

Hi(d) := ^log|{m e Mi, \m,i\ = d}\ , 

where d is any integer larger or equal to 1. Besides, -ffi(O) is set to for any i between 1 and p. 

These functions are analogous to the complexity measures introduced in [9] Sect. 1.3 or in [28] 
Sect. 3. 2. We shall obtain an oracle inequality for complexity-based penalties under the following 
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assumption. 

Assumption (Mx, v )'- Given K > 1 and n > 0, the collection M. and the number r\ satisfy 
V 2 < i < p ,Vm, 6 



1 + v/S^d^ Dj < V < V(K) , (6) 

where r)(K) is defined as rj(K) := [1 - 2(3/ (K + 2)) 1 / 6 ] 2 V[l - (3/K + 2) 1 / 6 ] 2 /4. The function n(.) 
is positive and increases to one with K. This condition requires that the size of the collection is 
not too large. Assumption (Hk- ))? ) is similar to the assumption made in [28] Sect 3.1 for obtaining 
an oracle inequality in the linear regression with Gaussian design framework. We further discuss 
(Mk.tj) in Sections 5 and 6 when considering the particular problems of ordered and complete 
variable selection. 

Theorem 4.4. Let K > 1 and let rj < i](K). Assume that n is larger than some quantity no(K) 
only depending on K and that the collection M. satisfies (Mk.t))- If the penalty pen(.) is lower 
bounded as follows 

- I / j \ 2 

peni(mi) > K — ( 1 + v2i?i(l TO j|) ) f or an V 1 < « < P an d m i e -Mi > (7) 

n — md V / 



< L K „ inf 



p 1 

/C (fi; fi m ) + pen(m) + - + r„ , (8) 



77 



iften i/ie rzsfc o/ f2 75 upper bounded by 

where t„ is defined by 

r„ = T(Cl,K,ri,n,p) := L K . n n 5/2 [p + IC{fl; I p )]exp[-nL 2 (K,r])} , 
and Li(K,r\) is positive. Here, I p stands for the identity matrix of size p. 

Remark 4.1. This theorem tells us that Q performs almost as well as the best trade-off between 
the bias term /C(f2;f2 m ) and the penalty termpen(m). The term p/n is unavoidable since it is of 
the same order as the variance term for the null model by Corollary The error term t„ is 

considered as negligible since converges exponentially fast to with n. 

Remark 4.2. The result is non- asymptotic and holds for arbitrary large p as longs n is larger than 
the quantity uq(K) (independent ofp). There is no hidden dependency on p except in the complexity 
functions Hi(.) and Assumption (Wk,ti) that we shall discuss for particular cases in Sections 5.1 
and 6.1. Besides, we are not performing any assumption on the true precision matrix except that 
it is invertible. In particular, we do not assume that it is sparse and we give a rate of convergence 
that only depends on a bias variance trade-off. Besides, there is no hidden constant that depends on 
f2 ( except for r n ). 

Remark 4.3. Finally, the penalty introduced in this theorem only depends on the collection M. and 
on a number K > 1 . One chooses the parameter K depending on how conservative one wants the 
procedure to be. We further discuss the practical choice of K in Sections 5 and 6. In any case, the 
main point is that we do not need any additional method to calibrate the penalty. 
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4-3. Penalties based on a prior distribution 

The penalty defined in Theorem 4.4 only depends on the models through their cardinality. However, 
the methodology developed in the proof easily extend to the case where the user has some prior 
knowledge of the relevant models. 

Suppose we are give a prior probability measure km = -kmi X • • • X hm p on the collection M. 
For any non-empty model rrii G Mi, we define l m \ by 

V2<i<p l Vm i eM, l f W := - b g(W"*>> , (9) 

\irii\ 

By convention, we set ii*' to 1. We define in the next proposition penalty functions based on the 

(i) 

quantity l m that allow to get non-asymptotic oracle inequalities. 

Assumption (H^ 1 ^): Given K > 1 and r\ > 0, the collection M, the numbers lm and the number r\ 
satisfy 



V 2 < i<p ,Vmi G M t , 



1 + \ 21 



2 



<V< V(K) , (10) 



n — \nii 

where n{K) is defined as in (H^. r) ). 

Proposition 4.5. Let K > 1 and let n < n(K). Assume that n > no(^) and that Assumption 
(H^ ; ) is fulfilled. If the penalty pen(.) is lower bounded as follows 



111; 



peni{m,i) > K — — p — - [ 1 + \j 2l m \ I for any 1 < i < p and any rrii G Mi , (11) 
n — \mi\ \ v 

then the risk of £1 is upper bounded by 



2 



E 



K.(n;ri)] < L Kv inf \K (0; fi m ) + pen(m) + -1 + r n , (12) 

where Lx,n an d T n the same as in Theorem 4-4- 

The proof is postponed to the technical Appendix [29]. 

Remark 4.4. In this proposition, the penalty (11) as well as the risk bound (12) depend on the 
prior distribution -km - In fact, the bound (12) means that f2 achieves the trade-off between the bias 
and some prior weight, which is of the order — log[7r» (m)]/n . This emphasizes that 17 favours 
models with a high prior probability. Similar risk bounds are obtained in the fixed design regression 
framework in Birge and Massart [8]. 

Remark 4.5. Roughly speaking, Assumption (^°^ y rj ) requires that the prior probabilities TTMi( m i) 
are not exponentially small with respect to n. 

5. Adaptive banding 

In this section, we apply our method ChoSelect to the adaptive banding problem and we investigate 
its theoretical properties. 
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5.1. Oracle inequalities 

Let d be some fixed positive integer which stands for the largest dimension of the models . For 
any 2 < i < p, we consider the ordered collections 

Mi >old := {0, {1}, {1, 2}, . . . , {1 A (i - d), . . . , i - 1}} , 

and Mf ord := {0}. A model m = (0, . . . , {1, . . . , fc 4 }, . . . , {1, . . . , k p }) in the collection M^ rd cor- 
responds to the set of matrices T such that on each line i of T, only the fcj closest entries to the 
diagonal are possibly non-zero. This collection of models is suitable when the matrix T is approxi- 
mately banded. 

For any 1 < i < p and any model m, in A4f ord we fix the penalty 

pemirm) = K 1 " . (13) 
fi - Nil 

We write f^ rd for the estimator O defined with the collection M^ Td and the penalty (13). 

Corollary 5.1. Let K > 1, 77 smaller than rj(K). Assume that d < njzb- If n is larger than some 
quantity uq (K) , then 

K (O; fiOl < L K , n inf E \k (O; ojl + r n (0, J%T, 77, n,p) . (14) 

This bound is a direct application of Theorem 4.4. 

Remark 5.1. The term r„ is defined in Theorem 4-4 an d * s considered as negligible since it con- 
verges to exponentially fast towards 0. Hence, the penalized estimator f2 achieves an oracle in- 
equality without any assumption on the target f2. 

Remark 5.2. This oracle inequality is non- asymptotic and holds for any p and any n larger than 
no(K). Moreover, by choosing a constant K large enough, one can consider a maximal dimension 
of model d up to the order of n, because r)(K) converges to one when K increases. 

Choice of the parameters K and d. Setting K to 2 gives a criterion close to AICc (see for instance 
[24]). Besides, Verzelen [28] (Prop. 3. 2) has justified in a close framework this choice of K is asymp- 
totically optimal. A choice of K = 3 is advised if one wants a more conservative procedure. We 
have stated Corollary 5.1 for models rrii of size smaller than d = j^—n. In practice, taking the size 
n/2 yields rather good results even if it is not completely ensured by the theory. 

Computational cost. The procedure is fast in this setting. Indeed, its complexity is the same as 
p times the complexity of an ordered variable selection in a classical regression framework. From 
numerical comparisons, it seems to be slightly faster than the methods of Bickel and Levina [6] and 
Levina et al. [22] which require cross-validation type strategics. 

5.2. Adaptiveness with respect to ellipsoids 

We now state that the estimator f^ rd is simultaneously minimax over a large class of sets that we 
call ellipsoids. 
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Definition 5.2. Let (o,i)i<i< p —i be a non-increasing sequence of positive numbers such that a\ = 1 
and let R be a positive number. Then, the set £(a,R,p) is made of all the non-singular matrices 
fi = T* S~ l T where S is in Diag(p) and T is a lower triangular matrix with unit diagonal that 
satisfies the following property 

Xi T[M 2 ~ J ' ]2 <R 2 > V2<z<p. (15) 

j'=i aj 

By convention, we set a p = 0. The sequence (aj) measures the rate of decay of each line of T 
when one moves away the diagonal. Observe that in this definition, every line of T decreases the 
same rate. To the price of more technicity, we can also allow different rates of decay for each line of 
T. We shall restrict ourselves to covariance matrices with eigenvalues that lie in a compact when 
considering the ellipsoid £(a,R,p) 

Boph) ■= I tpmm (^) > ~ and (p max (ft) < 7 j . (16) 

Proposition 5.3. For any ellipsoid £(a,R,p), the minimax rates of estimation is lower bounded 
by 

inf sup E \K (ft; ft)] > Lp sup ( R 2 a 2 A — — — ) ■ (17) 

Let us consider the estimator Cl d ord defined in Section 5.1 with d = L^i+~J and the penalty (13). 
We also fix j > 2. If the sequence (oj)i<j< p and R also satisfy R 2 > — and a 2 ^-^^ < -^r^, th 



en 



< ijf.11,/3,7 inf sup 1 

O n<E£(a,R,p)nB op (f) 



(18) 



sup E JC(f2;0£ o 
if n is larger than uq (K) 

Remark 5.3. The minimax rates of convergence over £(a,R,p) in the lower bound (17) is similar 
to the one obtained for classical ellipsoids in the Gaussian fixed design regression setting (see for 
instance [23] Th. 4-9)- We conclude from the second result that our estimator Cl d ord is minimax 
adaptive to the ellipsoids that are not degenerate (i.e. R 2 > 1/n) and whose rates (aj) does not 
converge too slowly towards zero (i.e. a 2 ^^ Ap < (R 2 ^/^)" 1 ). Note that all the sequences (aj) such 

that a 2 < R 2 ji satisfy the last assumption. 

Remark 5.4. However, the estimator Cl d ord is not adaptive to the parameter^ since the constant L 
in (18) depends on 7. This is not really surprising. Indeed, the oracle inequality (H) is expressed in 
terms of the Kullback loss while the ellipsoids are defined in terms of the entries ofT. If we would 
have considered the minimax rates of estimation over sets analogous to £(a,R,p) but defined in 
terms of the decay of the Kullback bias, then we would have obtained minimax adaptiveness without 
any condition on the eigenvalues. 

We are also able to prove asymptotic rates of convergence and asymptotic minimax properties 
with respect to the Frobenius loss function. For any s > 0, we define the ellipsoid £'(s,p, R) as the 
ellipsoid £(a,R,p) with the sequence (aj)i<i< p _i := i~ s . 

imsart-ejs ver. 2010/09/07 file: Chol_ejs.tex date: October 11, 2010 



N. Verzelen/Covariance estimation 



13 



Corollary 5.4. If ki+p n = o(n) and k := 1 Vmaxi<i< p fcj zs smaller than yjn then uniformly 
over the set t/ or d[(^i, • ■ • i k Pn ), +oo] H S op (7), 

n»-^ rd n 2 F = 0p( E£i ? fc ; +p " ) (19) 

If s > 1/2, then uniformly over the set £' (s, R,p n ) H B p(7), the estimator fi£. d satisfies 

\\n-n d OId \\ 2 F = o P 

Moreover, these two rates are optimal from a minimax point of view. 

The estimator achieves the minimax rates of estimation over special cases of ellipsoids. 

However, all these results depend on 7 and are of asymptotic nature. 




(20) 



6. Complete graph selection 

We now turn to the complete Cholesky factor estimation problem. First, we adapt the model 
selection procedure ChoSelect to this setting. Then, we derive an oracle inequality for the Kullback 
loss. Afterwards, we state that the procedure is minimax adaptive to the unknown sparsity both 
with respect to the Kullback entropy and the Frobenius norm. Finally, we discuss the computational 
complexity and we introduce a faster two-stage procedure. 



6.1. Oracle inequalities 

Again, d is a positive integer that stands for the maximal size of the models m*. We consider the 
collections of models Mf co that contain all the subsets of {1, . . . , i — 1} of size smaller or equal to 
d. A model m € -M^ corresponds to a pattern of zero in the Cholesky factors T. As explained in 
Section 2, such a model m is also in correspondence with an ordered graph which is compatible 
with the ordering. Hence, the collection A4^ is in correspondence with the set of ordered graphs 
~d of degree smaller than d which are compatible with the natural ordering of {1, . . .,p}. 
For any 2 < i < p and any model m, in Mf co we fix the penalty 



peniimi) = log 



1 + K 




1 + log 



i - 1 



(21) 



where K > 1. In the sequel, f^ corresponds to the estimator ChoSelect with the collection M^ 
and the penalty (21). 

Corollary 6.1. Let K > 1 and 7/ < rj'(K) (defined in the proof). Assume that 



d<r)- 



pog(p/d) V 0] 



(22) 
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If n is larger than some quantity no(K), then flf satisfies 



E 



K 



(^co) 



E 



1 + log 



i - 1 



P 
n 



(23) 



where the remaining term r' n is of the same order as r« in Theorem 4-4- 



A proof is provided in Section 10.3. We get an oracle inequality up to logarithms factors, but 
we prove in Section 6.2 that these terms log[(i — l)/|mi|] are in fact unavoidable. For the sake of 
clarity, we straightforwardly derive from (23) the less sharp but more readable upper bound 



E 



< Lk.ti inf { K, (fl; fl r 

mEM^ 



p + \m\ logp 



+ r n (n,K,r),n,p) , 



where 



Remark 6.1. As for the previous results, we do not perform any assumption on the target f2 and 
the obtained upper bound is non-asymptotic. By Condition (22), we can consider dimension d up 
to the order n/[\og(p/n) V 1]. If p is much larger than n, the maximal dimension has to be smaller 
than the order n/log(p). This is not really surprising since it is also the case for linear regression 
with Gaussian design as stated in [28] Sect. 3.2. There is no precise results that proves that this 
n/log(p) bound is optimal but we believe that it is unimprovable. If p is of the same order as n, it 
is possible to consider dimensions up to the same order as p. 

Remark 6.2. The same bound (23) holds if we use the penalty 



pen'^mi) = K 



n — \m,i\ 




1 + log 



For a given K, observe that perii(mi) = log(l+pen^(m;)). Hence, these two penalties are equivalent 
when n is large. In Corollary 6.1, we have privileged a logarithmic penalty, because this penalty gives 
slightly better results in practice. 

Choice of K and d. In practice, we set the maximal dimension to n/{2.5[2 + (log(p/n) V 0)]}. 
Concerning the choice of K, we advise to use the value 1.1, if the goal is to minimize risk. When 
the goal is to estimate the underlying graph, one should use a larger value of K like 2.5 in order to 
decrease the proportion of falsely discovered vertices. 



6.2. Adaptiveness to unknown sparsity 

In this section, we state that the estimator f2^ achieves simultaneously the minimax rates of 
estimation for sparsity of the matrix T. In the sequel, lA\[k,p] stands for the set of positive square 
matrices f2 = T*S~ l T of size p such that its Cholesky factor T contains at most k non-zero off- 
diagonal coefficients on each line. The set U\[k,p] contains the precision matrices of the directed 
Gaussian graphical models whose underlying directed acyclic graph satisfies the two following 
properties: 
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• It is compatible with the ordering on the variables. 

• Each node of y has at most k parents. 

We shall also consider the set U 2 [k,p] that contains positive square matrices whose whose Cholcsky 
factor is A:-sparse (i.e. contains at most k non-zero elements). Hence, the set U2[k,p] corresponds to 
the precision matrices of the directed Gaussian graphical models whose underlying directed acyclic 
graph y is compatible with the ordering on the variables and has at most k edges. When fl belongs 
to U2[k,p] with k "small", we say that the underlying Cholcsky factors T are ultra-sparse. 

For deriving the minimax rates of estimation, we shall restrict ourselves to precision matrices 
whose Kullback divergence with the identity is not too large. This is why we define 

B K {r) :={Q s.t. £(fi; I p ) < pr} , 

for any positive number r > 0. 

Proposition 6.2. Let k and p be two positive integers such that k < p. The minimax rates of 
estimation over the sets Ui[k,p] andU2[k,p] are lower bounded as follows 



inf sup En 

n neWi[fc, P ] 

inf sup En 

h n£U 2 [k, P ] 



> Lkp 



1 + log (p/k) 



> L 



p + k log(p) 



, ifn>Lk 2 [l + log(p/k)] , (24) 
ifk<p. (25) 



Consider K > 1, > 1, and r\ < i~j(K). Assume that n > n$(K) and choose a positive integer 
d that satisfies Condition (22). The penalized estimator Q d defined in Corollary 6.1 is minimax 
adaptive over the sets U\[k,p\ PI Bic(n^) for all k smaller than d that also satisfy n > Lk 2 (l + 
log(p/k)). Lt is also minimax adaptive over U2[k,p] PI B/c(n^) for all k less than d: 



sup En 

neWi[fc,p]ne,cO' 3 ) 

sup En 

fteu 2 [k,p]nB K {nfi) 



ic ( n ; nt 



< Lk,p,t) inf sup En 

Q OeW 1 [fe,p]nB K ;(n' 3 ) 

< Lk,0,ti inf sup En 

Q neu 2 [k, P ]nB K (nf 3 ) 



ic(n-h 



Remark 6.3. The minimax rates of estimation over Ui[k,p] is of order kp[l + log (p/k)]/n. We 
do not think that the condition n > Lk 2 [l + log(p/k)] is necessary but we do not know how to 
remove it. The technical condition K. (f2; I p ) < pn@ is not really restrictive. Lt comes from the term 
n 5 / 2 /C(f2; L p ) exp [— nLx.-q] in Theorem 4-4 which goes exponentially fast to with n as long as 
K(Q,,I p )/p is grows polynomially with respect to n. In conclusion, our estimator Q, d co is adaptive to 
the sparsity of its Cholesky factor T. 

Remark 6.4. Let us translate the proposition in terms of directed graphical models. The Kullback 
minimax rate of covariance estimation over graphical models with at most k parents by node is of 
the order pk (1 +log(p/k))/n. Moreover, the Kullback minimax rate of covariance estimation over 
graphical models with at most k vertices is of the order (p+k\ogp)/n. Finally, Q, d co is minimax adap- 
tive for estimating the distribution of a sparse directed Gaussian graphical model whose underlying 
graph is unknown. 

We can also consider the rates of convergence with respect to the Frobenius norm or the operator 
norm in the spirit of the results of Lam and Fan [19]. We recall that ||.||.p and ||.|| respectively refer 
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to the Frobenius norm and the operator norm in the space of matrices. We also recall that the set 
B op (-f) is defined in (16). 

Corollary 6.3. Let K > 1, i] < rj(K), 7 > 2, and let d be the largest integer that satisfies (22). If 
Pnk n [l +log(p n /fc n )] = o(n), then 



\fi-n d co \\% = P [k n 




(26) 



uniformly on Ui[k n ,p n ] DB op [7]. If p n + k n log(p„) =o(n), then 



l^-^coll! 



Oi 



10 -oill = o f 



'Pn + fc„log(p„) 



iPn + k n log (p n ) 



(27) 



uniformly on I42[k n ,p n ] D B op [j]. Moreover, all these Frobenius rates of convergence are optimal 
from a minimax point of view. 

Remark 6.5. The estimator fl d is asymptotically minimax adaptive to the sets U\\k,p\ H i3 op (7) 
and U2[k,p] n£> p(7) with respect to the Frobenius norm. Moreover, these rates are coherent with 
the ones obtained by Lam and Fan in Sect. 4 of [19]. We do not think that the rates of convergence 
with respect to the operator norm are sharp. 

Remark 6.6. These results are of asymptotic nature and require that p n has to be much smaller 
than n. Besides, the upper bounds on the rates highly depend on the largest eigenvalue tp max (tt). 
This is why we have restricted ourselves to precision matrices whose eigenvalues lie in the compact 
[I/7; 7]. Nevertheless, to our knowledge all results in this setting suffer from the same drawbacks. 
See for instance Th.ll of Lam and Fan [19]. 



7. A two-step procedure 

The computational cost of £l d a is proportional to the size of Adf co , which is of the order of p d . 

Hence, it becomes prohibitive when p is larger than 50. In fact, fl d Q minimizes a penalized criterion 
over the collection Ai d Q . Nevertheless, the collections Aif co contain an overwhelming number of 
models that are clearly irrelevant. This is why we shall use a two-stage procedure. First, we compute 
a subcollection of M. d Q - Then, we minimize the penalized criterion over this subcollection. 

Suppose we are given a fast data-driven method that computes a subset Mi of M d co for any i 
in l,...p. 
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Algorithm 7.1. Computation of m' and Ct* 
1. For i going from 1 to p, 

• Compute the subcollection Mi of Mf co . 

• Compute s'i^ mi for each model rrii € Mi. 

• Takem{ := argmin m . g ^. log(s i)TOi ) +pem(mi) 



2. Set = (m{ , . . . , fh0 and build (T^,S^) by gathering the estimators (t i ,s~ 4 f^f)- 

3. Take fit = f f (S f )- 1 f f . 



In what follows, we refer to this method as ChoSelect . For any 2 < i < p and any model m, 
in Mf co , we advise to fix the penalty as in Section 6.1: 



peni(rrii) = log 



1 + K- 



1 



n - \mi\ 

with K > 1. K = 1.1 gives good results in practice. 



I + log 



i - 1 



Remark 7.1. Observe that we use the same data for computing the collections Mi and the estima- 
tor f2* . The estimator Q,* exhibits a small risk as long as the collections Mi contain good models 
as shown by the following proposition: 

Proposition 7.1. Let m be a model in M d co and A m be the event such that m £ M\ x . . . x M p . 

Under the same assumptions as Corollary 6.1, it holds that 

1 + 1 ° g ( \ m t\ ). + n\ 

(28) 

where r n is defined in Theorem 4-4- 

Remark 7.2. Hence, under the event A m * where m* is the oracle model, fl? achieves the optimal 
of convergence. The estimator achieves also a small risk as soon as any "good" model belongs to the 
estimated collection. Here, "good" refers to a small Kullback risk. Observe that it is much easier to 
estimate a collection Mi that contains a "good" model than directly estimating a "good" model. 

In fact, Algorithm 7.1 and Proposition 7.1 are generally applicable to any collection M and 
penalties defined by (7) or (11). 

The computational cost of Algorithm 7.1 is directly related to the cost of the computation of 
Mi and to the size of the collections Mi- The challenge is to design a fast procedure providing a 
fairly small collection Mi, which contains relevant models with large probability. Let us describe 
two examples of such a procedure. 



E 



K n:Q f ) 1 



< L K .Jic(n;n m ) + J2 — 



i=2 
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Algorithm 7.2. Computation of the collection Mi by the Lasso. 

Let D be an integer smaller than 2 5[2+(iog(p/n)vo)] anc ^ ^ ^ ^ e any P os ^ ve integer. 

1. Using the LARS [10] algorithm, compute the regularization path of the Lasso for the 
regression of X; with respect to the covariates X<^ . 

2. Order the variables X^, ■ • ■ , -XV(i-i)A-D) with respect to their appearance in the regular- 
ization path. 

3. Take Mi :=V(X (1) ,...,X {kA{i _ 1)AD) )[jRP(i,D), 

where V{A) contains all the subsets of A and where RP{i,D) is the regularization path 
stopped at D variables. 



Remark 7.3. The size of the random collection Mi increases with the parameter k. Suppose that 
i is larger than D. The size of Mi is generally of the^order 2 k V D. The case k = corresponds 
to choosing the regularization path of the Lasso for Mi- The estimator f2-' then performs as well 
(up to a logp factor) as the best parametric estimator with a model in the regularization path. The 
collection size is fairly small, but the oracle model may not belong to Mi with large probability. 
This is especially the case is the true covariance E is far from the identity since the Lasso estimator 
is possibly inconsistent. In many cases, the true ( or the oracle ) model is a submodel of the model 
selected by the Lasso with a suitable parameter [2]. When choosing k = D, it is therefore likely that 
the true model or a "good" model belongs to Mi- 

The regularization path of the Lasso is not necessarily increasing [10]. If we want that M contains 
all subsets of sparse solutions of the Lasso we need to use a variant of the previous algorithm: 



Algorithm 7.3. Let D be an integer smaller than 2 5[2+(iog(p/n)vo)] an ^° ^ ^ ^ e an y V os iti ve 
integer. 

1. Using the LARS [10] algorithm, compute all the Lasso solutions for the regression o/X^ 
with respect to the covariates X<^. 

2. For any A > ; consider the set of {Xj 1 , Xj 2 . . . Xj 3 } of variables selected by the Lasso. 
If s\> k we define Af = while we take A^ = V(Xj 1 , . . . , Xj B ) is s\ < k. Here, V(A) 
contains all the subsets of A. 

3. Take Mi := U A>0 4 A U AP(*. D), 

where V(A) contains all the subsets of A and where RP(i,D) is the regularization path 
stopped at D variables. 



In the following proposition, we show the ChoSelect-^ outperforms the Lasso under restricted 
eigenvalue conditions. We consider an asymptotic setup where p and n go to infinity with p larger 



ASSUMPTIONS: 

• (H.l) The covariance matrix E satisfies restricted eigenvalue conditions of order q* > 0. 

^ < uJ^au < ^ y A with |^| = ? » and u e R9 * _ 

u*u 

Moreover, we assume that and q* log(p)/n goes to when p and n go to infinity. 
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(H.2) Fix some v < 1. The vector t p (which corresponds to the p-th line of T) is g-sparse 
with some q < V j^^- The set of non-zero component is denoted m*. Let us set some 
K > 24 V (2/(1 - u)) and define 



32 

M 2 (K,c*) = — 



2 112c* /16c* 



\/ [4(X + 12)/c 



3 9c* V 3c 
For any zero-component i p [j] , we have 

t p \J] 2 >M 2 (K,c*) q -^a*. 

• (H.3) Define Mi(c*,c*) = 2 + 16^-. The quantities 9 and q* are such that 

Mi(c„,c*)q + 1< q* . 

Proposition 7.2. Consider the procedure ChoSelect^ with K as in (H.2) and the penalty (21) 
and the algorithm 7.3. Take k > M*q and D = n/\og(p) 2 . Under Assumptions (H.l), (H.2), and 
(H.3) 

P [fn* = m*, p ] 1 . 

The proof of the proposition is postponed to the appendix [29] 

Remark 7.4. In contrast to ChoSelecv , the Lasso procedure does not consistently select the support 
of t p under restricted eigenvalue conditions [35, 34]- Observe that our assumptions (H..1), (H.2j, 
(H.3j and our result are quite similar to the ones obtained by the stability selection method of 
Meinshausen and Biihlmann [25]. 

Remark 7.5. Under similar conditions, one can prover that ChoSelect^ selects consistently the 
support of any vector ti for n < i < p. In order to consistently estimate the whole pattern of zero 
of T , one needs to slightly change the penalty (21) by replacing (i — 1) by (i — 1) V n. 

Remark 7.6. For the sake of simplicity, we have only described two methods for building the 
collection M. One may also use a collection based on the adaptive Lasso or more generally any 
(data-driven) collection M. Moreover, ChoSelect? can be interpreted as a way to tune an estimation 
procedure and to merge different procedures. Suppose wejare given a collection A of estimation 
procedure. For any procedure a € A, we build a collection M. a using the model corresponding to the 
estimator Sl a or using a regularization path associated to a (if possible). If we take the collection M. 
as the reunion of all M. a for a € A, then by Proposition 7.1 the estimator lV nearly selects the best 
model (from the risk point of view) among the ones previously selected by the procedures a <E A. 

8. Simulation Study 

In this section, we investigate the practical performances of the proposed estimators. We concentrate 
on two applications: adaptive banding and complete graph selection. 
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8.1. Adaptive banding 
8.1.1. Simulation scheme 



Simulating the data. We have used a similar scheme to Levina et al. [22]. Simulations were carried 
out for centered Gaussian vectors with two different precision models. The first one has entries of 
the Cholcsky factor exponentially decaying as one moves away from the diagonal. 

Oi: T[i,j] = 0.5^, j<i; s t = 0.01 

The second model allows different sparse structures for the Cholesky factors. 

n 2 : ki ~ U{1, \j/2]); T[i,j] = 0.5, i - h < j < i - 1 

T[i,j]=0, j<i-kn Si =0.01 

Here U (fci, k 2 ) denotes an integer selected uniformly at random from all integers from k x to k 2 . We 
generate from this structure for p = 30. Levina et al. pointed out that this structure can generate 
poorly conditioned covariance matrix for larger p. To avoid this problem, we divide the variables for 
p = 100 and p = 200 into respectively 4 and 8 different blocks and we generate a random structure 
from the random structure from the model described above for each of the blocks. 

For each of the covariance models, we generate a sample of n = 100. We consider three different 
values of p: 30, 100, and 200. 



We apply the following procedures: 

• our procedure ChoSelect as described in Section 5. More precisely, we take the collection 
M^ 2 \ the penalty (13), and K = 3. 

• the nested Lasso method of Levina et al. [22] . It is computed with the J\ penalty, while its 
tuning parameter is selected via 5-fold cross-validation based on the likelihood. Wc have used 
the penalty J\ instead of J2 for computational reasons. 

• the banding procedure of Bickel and Levina [6] . The tuning parameter is chosen according 
to Sect. 5 in [6] with 50 random splits. 

• the regularization method of Ledoit and Wolf [21]. 

For the first covariance model Oi, we also compute the oracle estimator, i.e. the parametric 
estimator which minimizes the Kullback risk among all the estimators fi m with m £ Ai^^ 2 ^ . Wc 
recall that the computation of the oracle estimator require the knowledge of the target Ox- The 
performances of this estimator are presented here as a benchmark. The experiments are repeated 
N = 100 times. In the second scheme, N\ = 10 precision matrices are sampled and N2 = 10 
experiments are made for each sample. 

8.1.2. Results 

In Tables 1 and 2, we provide evaluations of the Kullback loss 

/C(f2;0) := ^ ir^fT 1 ) - log( | SI 1 1 1 1 ) -p 
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the operator distance ||f2 — f2||, and the operator distance between the inverses — E|| for any 

of the fore-mentioned estimators. We have chosen the Kullback loss because of its connection with 
discriminant analysis. The two other loss functions are interestingly connected to the estimation of 
the eigenvalues and the eigenspaces. 

For the second structure, we also consider the pattern of zero estimated by our procedure, the 
nested Lasso and the banding method of Bickel and Levina. More precisely, we estimate the power 
(i.e. the fraction of non-zero terms in T estimated as non-zero) and the FDR (i.e. the ratio of the 
false discoveries over the true discoveries) in Table 3. 



Method 


Ledoit 


Banding Nested Lasso 


ChoSelect 


Oracle 




Kullback discrepancy K(Q; Q) 


P = 


30 


2.00 ±0.05 


0.90 ±0.05 0.87 ±0.02 


1.00 ±0.03 


0.79 ±0.02 


P = 


100 


14.4 ±0.5 


3.6 ±0.4 3.2 ±0.1 


3.7 ±0.1 


2.9 ±0.1 


P = 


200 


33.4 ±0.6 


9.8 ±1.5 6.4 ±0.1 


7.5 ±0.1 


5.9 ±0.1 






Operator distance ||fi — Q| 


x 10~ 2 




P = 


30 


1.86 ± 0.07 


1.28 ±0.06 1.18 ±0.04 


1.36 ±0.06 


1.19 ±0.04 


P = 


100 


1.76 ± 0.09 


1.68 ±0.14 1.52 ±0.06 


1.75 ±0.06 


1.49 ±0.05 


P = 


200 


1.33 ± 0.01 


2.19 ±0.22 1.61 ±0.04 


1.92 ±0.06 


1.61 ±0.05 






Operator distance 1 1 1 


- s ll 




P = 


30 


0.14 ±0.02 


0.15 ±0.02 0.17 ±0.02 


0.15 ±0.02 


0.14 ±0.02 


P = 


100 


1.4 ±0.2 


1.4±0.2 1.7±0.2 


1.5 ±0.2 


1.4 ±0.2 


P = 


200 


5.9 ±0.6 


5.6 ±0.7 6.8 ±0.7 


6.5 ±0.6 


5.9 ±0.6 



Table 1: Estimation and 95% confidence interval of the Kullback risk, the operator distance risk, 
and the operator distance between inverses risk for the first covariance model f2i. 



Method 


Ledoit Banding Nested Lasso 


ChoSelect 




Kullback discrepancy K(Q; f!) 


P = 


30 


112 ±4 3.2 ±0.2 3.2 ±0.2 


1.2 ± 0.1 


P = 


100 


253 ±7 27.4 ±1.6 7.6 ± 0.2 


3.5 ± 0.1 


P = 


200 


565 ± 5 58 ±2 14.6 ± 0.2 


7.2 ± 0.1 




Operator distance ||fi — Q\\ X 10~ 


2 


P = 


30 


9.6 ±0.4 8.2 ±0.4 7.3 ± 0.4 


3.6 ± 0.3 


P = 


100 


8.7 ±0.2 8.2 ±0.2 6.8 ± 0.2 


3.8 ± 0.2 


P = 


200 


10.0 ±0.2 9.5 ±0.3 7.9 ±0.3 


4.4 ±0.2 




Operator distance — S|| X 1C 


-3 


P = 


30 


13.4 ±4.2 12.9 ±4.0 14.1 ±4.4 


12.9 ±4.0 


P = 


100 


1.5 ±0.4 1.4 ±0.4 1.3 ±0.4 


1.4 ±0.4 


P = 


200 


1.8 ±0.2 1.3 ±0.2 1.3 ±0.2 


1.3 ± 0.2 



Table 2: Estimation and 95% confidence interval of the Kullback risk, the operator distance risk, 
and the operator distance between inverses risk for the second covariance model SI2- 



Comments of Tables 1 and 2: In the first scheme fii, the three methods based on Cholesky decom- 
position exhibit a Kullback risk close to the oracle. The ratio of their Kullback risks over the oracle 
risk remains smaller than 1.4. The risk of the nested Lasso and the banding method is about 15% 
smaller than the risk of ChoSelect. We observe the same pattern for the operator distance between 
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precision matrices. In contrast, all these estimators have more or less the same risks for the operator 
distance between the covariance matrices. The estimator of Ledoit and Wolf is a regularized version 
of the empirical covariance matrix. Its performances with respect to the Kullback loss are poor but 
it behaves well with respect to the operator norms. 

In the second scheme, the method of Ledoit and Wolf performs poorly with respect to the 
Kullback loss functions and the first operator norm loss function. ChoSclcct performs two times 
better than the nested Lasso in terms of the Kullback discrepancy and the operator distance between 
precision matrices. The banding method exhibits a far worse Kullback risk. As in the first scheme, 
the three procedures based on Cholcsky decomposition perform similarly in terms of the operator 
distance between covariance matrices. These last risks are high for p = 30 because the covariance 
matrix is poorly conditioned in this case and its eigenvalues are high. 

The banding method only performs well if the Cholesky matrix T is well approximated by a 
banded matrix, which is not the case in the second scheme. The nested Lasso seems to perform 
well when there is an exponential decay of the coefficients as in the first scheme. However, its per- 
formance seem to be far worse when the decay is not exponential. In contrast, ChoSelect seems 
to always perform quite well. This observation corroborates the theory: indeed, we have stated in 
Corollary 5.1 that ChoSelect satisfies an oracle inequality without any assumption on S. Finally, 
there no clear interpretation for the risk with respect to the operator norm between covarianccs. 





Power X 10^ 


FDRxlO^ 


Method 


Banding 


Nested Lasso 


ChoSelect 


Banding 


Nested Lasso 


ChoSclcct 


p = 30 


69.7 ±2.3 


82.4 ±0.3 


99.2 ± 1.1 


23.0 ± 1.0 


17.9 ±0.2 


4.7±0.1 


p = 100 


27.0 ±0.1 


82.5 ±0.01 


99.4 ±0.2 


3.0 ±0.1 


25.7 ±0.2 


5.0 ±0.1 


p = 200 


26.2 ±0.1 


82.9 ±0.1 


99.6 ±0.1 


3.5 ±0.1 


10.0 ±0.2 


4.5 ±0.2 



Table 3: Estimation and 95% confidence interval of the power and FDR for the second precision 
model Q2- 



Estimating the pattern of zero. In the second scheme, we can compare the ability of the procedures 
to estimate well the pattern of non-zero coefficients (Table 3). The banding method docs not work 
well since the Cholesky factor T is not banded. ChoSelect a higher power and a lower FDR than 
the nested Lasso. 

8.2. Complete Graph selection 

8.2.1. Simulation scheme 

Simulating the data. In the first simulation study, we consider Gaussian random vectors whose 
precision matrices based on directed graphical models. 

1. First, we sample a directed graph G in the following way. For any node i in {2, . . . ,p} and 
any node j < i, we put an edge going from j to i with probability (Esp/(i — 1) A 0.5), where 
Esp is a positive parameter previously chosen. Hence, the expected number of parents for a 
given node is EspA(i — l)/2. 
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2. The precision matrix is then defined from ~G . 

il\ : T[i,j] ~ £7nif[-l, 1] if j i in ~& , 

T[i,j] = lifz=j and T[i,j} = else. 
- t/nif [1, 2] 

In the simulations, we set p = 30, 100, 200, Esp= 1, 3, 5, and n = 100. 

In the second simulation scheme, we consider the case where the "good" ordering is partially 
known. More precisely, we first sample a precision matrix according to the first simulation 
scheme. Then, we sample uniformly 10 variables and change uniformly their place in the ordering. 
This results in a new precision matrix ft^. Its Cholcsky factor is generally less sparse than the one 
of ill ■ The purpose of this scheme is to check whether our method is robust to small changes in the 
ordering. For this study, we choose p — 200, Esp= 1, 3, 5, and n = 100. 

We compute the following estimators: 

• the procedure ChoSelect f as described in Section 7. We take the collection Ai^ Q with 

d = 9 5[2+i"°-(nAp)] ' r ^^ ie c °ll ec ti° n M is computed according to Algorithms 7.1 and 7.2 with 
k = 8. Finally, we use the penalty (21) with K = 1.1. 

• the procedure ChoSelect with collection A4l Q , the penalty (21) with K = 1.1. Since this 
method is computationally prohibitive, we only apply it for p = 30. 

• the regularization method of Ledoit and Wolf [21]. 

• the Glasso method [3]. It is computed using the Glasso R-package by Friedman et al. [13], 
while the tuning parameter is chosen via 5-fold cross validation based on the likelihood. 
Following Rothman ct al. [27] and Yuan and Lin [33], we do not penalize the diagonal of f2. 

• the Lasso method of Huang et al. [15]. The regularization parameter is calculated by 5-fold 
cross validation based on the likelihood. 

For each estimator and simulation scheme, we evaluate the Kullback loss K.(fl; f2), the operator 
|| O — , and the operator distance between the inverses 1 1 £72 1 — E|| . We also consider the pattern of 
zero estimated by our procedure ChoSelect and the Lasso of Huang et al. [15]. More precisely, we 
evaluate the power (i.e. the fraction of non-zero terms in T estimated as non-zero) and the FDR (i.e. 
the ratio of the false discoveries over the true discoveries) in the first simulation study. Empirical 
95% confidence intervals of the estimates are also computed. The experiments are repeated N = 100 
times: Ni = 10 precision matrices are sampled and N2 = 10 experiments are made for each precision 
matrix sampled. 

8.2.2. Results 

Comparison of ChoSelect and ChoSelectf. In Table 4, we have set p = 30 in order to compute the 
method ChoSelect and compare it with ChoSelect f . It seems that both methods perform more or 
less similarly. When the sparsity of the Cholcsky factor decreases (Esp=5), ChoSclcct f exhibits a 
slightly smaller Kullback risk. 

These simulations confirm that ChoSelect f exhibits similar performances to ChoSelect with a 
much small computational complexity. In the other simulations, we only compute ChoSelect f . 
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Kullback discrepancy K(Q; Q) 


Method 


ChoSelect' 


ChoSelect 


Esp=l 


0.69 ± 0.04 


0.69 ±0.04 


Esp=3 


1.29 ± 0.04 


1.31 ±0.05 


Esp=5 


1.95 ± 0.06 


1.82 ±0.06 



Table 4: Comparison between ChoSelect and ChoSelect using the first covariance model flf and 
p = 30. 





Method 




Ledoit 


Glasso 


Lasso 


ChoSelect' 




Kullback discrepancy K(Q; Q) 


P = 


100 


Esp= 


1 


7.7 ±0.1 


3.7±0.1 


3.1±0.1 


2.6 ±0.1 




Esp= 


3 


13.9 ±0.2 


9.4 ±0.1 


7.2 ±0.1 


5.9 ±0.1 




Esp= 


5 


16.7 ±0.2 


12.6 ±0.2 


10.9 ± 0.2 


10.1 ±0.2 


P = 


200 


Esp= 


1 


19.4 ±0.2 


9.4 ±0.2 


7.4 ±0.1 


5.9 ±0.1 




Esp= 


3 


41.0 ±0.8 


21.7 ±0.3 


18.1 ± 0.2 


13.6 ±0.2 




Esp= 


5 


54.8 ±2.1 


35.2 ±0.2 


28.8 ± 0.3 


24.7 ±0.4 




Operator distance ||H — Q\\ 


P = 


100 


Esp= 


1 


5.5 ±0.2 


4.6 ±0.2 


3.8 ±0.2 


3.2 ±0.1 




Esp= 


3 


8.6 ±0.2 


9.3 ±0.2 


6.8 ±0.2 


4.6 ±0.1 




Esp= 


5 


11.5 ±0.1 


11.9 ±0.2 


9.5 ±0.1 


7.6 ±0.3 


p = 


200 


Esp= 


1 


6.2 ±0.1 


5.7 ±0.2 


4.6 ±0.1 


3.8 ±0.2 




Esp= 


3 


10.6 ±0.1 


10.7 ±0.2 


8.8 ±0.2 


5.4 ±0.1 




Esp= 


5 


15.0 ±0.3 


15.0 ±0.2 


13.0 ±0.3 


8.1 ±0.2 






Operator distance J f2 1 — 




p = 


100 


Esp= 


1 


1.5 ±0.1 


1.1 ±0.1 


1.1 ±0.1 


0.8 ±0.1 




Esp= 


3 


4.3 ±0.2 


3.9 ±0.2 


5.5 ±0.3 


3.6 ±0.3 




Esp= 


5 


8.4 ±0.5 


9.1 ±0.7 


13.0 ±0.7 


8.4 ±0.5 


p = 


200 


Esp= 


1 


2.4 ±0.1 


1.9 ±0.1 


1.7 ±0.1 


1.2 ±0.1 




Esp= 


3 


8.3 ±0.5 


6.3 ±0.3 


10.7 ±0.6 


6.6 ±0.3 




Esp= 


5 


16.9 ± 1.4 


14.7 ± 1.0 


30.3 ± 2.9 


17.6 ± 1.6 



Table 5: Comparison between the procedures for the first covariance model fij. 
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Estimation of ft. This study corresponds to the situation where a "good" ordering of the variables 
is known. In Table 5, ChoSelect f has a smaller Kullback risk than the Lasso, which is better than 
the Glasso, and Ledoit and Wolf's method. This is especially true when p is large. We also observe 
the same results it terms of the operator distance between the precision matrices. The results for 
the operator distance between covariance matrices are more difficult to interpret. It seems that the 
risk of the Lasso is high, while the Glasso and ChoSelect f perform more or less similarly. Ledoit 
and Wolf's method gives good results when Esp=3, 5. 



Method 


Lasso 


ChoSelect 1 




Power x 10^ FDR x 10^ 


Power x 10^ FDR x 10^ 


Esp=l 


58.0 ±0.6 79.9 ±0.4 


40.6 ±0.6 5.4 ±0.6 


Esp=3 


65.3 ±0.6 72.7 ±0.3 


50.9 ±0.5 9.7 ±0.4 


Esp=5 


67.4 ±0.4 69.2 ±0.2 


52.0 ±0.3 21.1 ±0.7 



Table 6: Estimation and 95% confidence interval of the power and FDR for the first covariance 
model ftl with p = 200. 

Estimation of the graph. In Table 6, we compare the ability of the procedures to estimate the 
underlying directed graph. This is why we only consider the procedures based on Cholesky decom- 
position: the Lasso of Huang et al. and ChoSelect f . The Lasso exhibits a high power but also a high 
FDR (larger than 50%). In contrast, ChoSelect f keeps the FDR reasonably small to the price of 
a small loss in the power. When p increases, the power of the procedures decreases. These results 
corroborate the results of Proposition 7.2. When the number of parents (i.e. ESP) increases, it 
seems that the FDR of the ChoSelect increases. We recall that if one wants a lower FDR in the 
graph estimator, one should choose a larger value for K . In practice, taking K = 2.5 or K = 3 
enforces the FDR to be smaller than 10%. 



Method 


Ledoit 


Glasso Lasso 


ChoSelect* 




Kullback discrepancy K(Q; Q) 


Esp= 


1 


19.2 ±0.2 


8.8 ±0.2 7.5 ±0.1 


6.0 ±0.1 


Esp= 


3 


39.6 ±0.7 


21.8 ±0.2 18.9 ±0.2 


14.7 ±0.2 


Esp= 


5 


56.4 ± 1.4 


35.6 ±0.3 32.0 ±0.4 


28.9 ±0.4 




Operator distance \\Q — Sl|| 


Esp= 


1 


6.4 ±0.2 


5.6 ±0.1 4.8 ±0.2 


3.8 ±0.1 


Esp= 


3 


10.5 ±0.2 


10.7 ±0.2 8.6 ±0.2 


5.9 ±0.2 


Esp= 


5 


15.0 ±0.1 


14.7 ±0.3 13.6 ±0.2 


9.1 ±0.2 




Operator distance 1 — S|| 


Esp= 


1 


2.4 ±0.1 


1.7 ±0.1 1.8 ±0.1 


1.3 ±0.1 


Esp= 


3 


7.6 ±0.4 


6.3 ±0.4 9.3 ±0.5 


6.6 ±0.4 


Esp= 


5 


20.1 ± 1.6 


16.3 ± 1.3 35.1 ±2.5 


21.5 ±1.5 



Table 7: Comparison between the procedures for the second covariance model ft^, with p = 200. 



Effect of the ordering. In Table 7, we study here the performances of the procedures when the 
ordering of the variables is slightly modified. The Glasso method and the rcgularization method of 
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Ledoit and Wolf perform as in the first scheme since these procedures do not depend on a particular 
ordering of the variables. Lasso and ChoSelect f procedures provide slightly worse results than in the 
first scheme, especially when the sparsity decreases. Indeed, the effect of a bad ordering is higher 
when the sparsity is low. Nevertheless, ChoSelect f still performs better than the other procedures 
for the Kullback risk and the operator distance between precision matrices, while the Glasso and 
ChoSclect f still perform similarly the operator distance between covariance matrices. The respective 
performances are different when the ordering is completely unknown (see the Appendix [29]). 

Conclusion. When the ordering is known or partially known, ChoSelect f has a small risk with 
respect to the Kullback discrepancy and the operator distance between precision matrices. Moreover, 
ChoSelect f provides a good estimation of the underlying graph. It is difficult to interpret the results 
for the operator distance between the covariance matrices. If the objective is to minimize the 
operator distance || £ — S|| , it seems that a direct estimation of £ should be prefered to the inversion 
of an estimation of tt. 

9. Discussion 

Adaptive banding problem. ChoSclcct achieves an oracle inequality and is adaptive to the decay in 
the Cholesky factor T. We have also derived corresponding asymptotic results for the Frobenius loss 
function. This procedure is computationally competitive with the other existing methods. Finally, 
we explicitly provide the penalty and there are therefore no calibration problems contrary to most 
procedures in the literature. In a future work, we would like to study the performances of ChoSelect 
with respect to the operator norm and prove corresponding minimax bounds. Bickel and Levina 
have indeed proved risk bounds for their banding procedure [6] . This method is based on maximum 
likelihood estimators as ChoSelect. This is why we believe that ChoSelect may also satisfy fast rates 
of convergence with respect to the operator distance. 

Complete graph estimation problem. We have derived that ChoSelect satisfies an oracle type inequal- 
ity and we have derived the minimax rates of estimation for sparse Cholesky factors T. ChoSelect 
is shown to achieves minimax adaptiveness to the unknown sparsity of Cholesky factor. As in the 
banded case, we provide an explicit penalty. However, this procedure is computationally feasible 
only for small p. In contrast, the method ChoSclcct f introduced in Section 7 shares some advantages 
of the previous method with a much lower computational cost. In Algorithm 7.2, we propose two 
collections based on the Lasso. In practice, there are maybe smarter ways of building the collections 
Mi than using the Lasso. 

10. Proofs 

10.1. Some notations and probabilistic tools 

First, we introduce the prediction contrasts /;(.,.). Consider i be an integer between 2 and p and 
let (t, t') be two row vectors in R l_1 then the contrast li(t,t') is defined by 




(29) 
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Consider a model m, £ Mi- We define the random variable e mi by 

3 emi 



a.s. 



(30) 



By definition of ti, mi in Section 4.1, the variable e mi is independent of e and of X rni . Besides, 
its variance equals h{ti tmi ,ti). If follows from the definition of Sj )TOi that Si j7ni = h(ti tmi ,ti) + Si. 
The vectors e and e m refer to the n samples of e and e m . For any model m and any vector Z of 
size n, H m Z refers to the projection of Z onto the subspace generated by (Xi) i6m whereas IL^Z 
stands for Z ~H m Z . For any subset m of {1, . . . S m denotes the covariancc matrix of the vector 
X* n . Moreover, we define the row vector Z m := X m \/Lra in order to deal with standard Gaussian 
vectors. Similarly to the matrix X m , the n x \m\ matrix Z m stands for the n observations of Z m . 

Lemma 10.1. The conditional Kullback-Leibler divergence K, (ti, s^t^, s,-) decomposes as 



(til SijtiT $i) 2 



logfi + fi-l 



T7ie estimators t; 



and s'i mi are expressed as follows 



Y .+* 

A <> 1, i,m, 



— — X m; (XJ„.X Tn ) X^Xj , 

In = l|n mi ( e i,mj + e i)lln 



I ni, . X.; 



(31) 



(32) 
(33) 



This lemma is a consequence of the definitions of t 
and 4.1. 



2, mi : °z,mi ; 



and /C (ii , ; t\ ; , sj ; ) in Sections 3 



10.2. Proof of Proposition 4-1 



Proof of Proposition J^.l. First, we decompose the Kullback-Leibler divergence into a bias term and 
a variance term using Expression (31). 



f"2/C (tit &i ! ti^ rni , Si^ rni ) J — E 



loE 



"1" li\ti,mi j ^i) 



- 1 



By definition, £i jTOi is the least-squares estimator of over the set of vectors of size i — 1 whose 
support is included in m,i and —X^t* m . is the best predictor of Xi given X mi . Hence, the prediction 
error li(ti imi ,ti) + Sj equals h(t i mi ,ti^ mi ) + Sj :TOj and it follows that 



E [2/C (ti , , ti rrli , Si^ mi ) j — 2/C (^ , , ti mi , s^ mi ) 

1% (ii : mi i ti,mi ) 



E 



log- 



(34) 



Let us compute the expectation of these three last terms. Notice that risi^ mi / Si :Tni = n||II^.Xj||„/sj 
follows the distribution of a x 2 distribution with n — \rrii\ degrees of freedom. 

lmjl+2 



E 




= E 









X 2 (« - 



1 



n - |mt| — 2 



(35) 
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by Lemma 5 in [4]. Similarly, we compute the expectation of the logarithm as follows: 



log' 



= E 



log 



X 2 (n - \rrii 



= *(n - \rrii\) + log 



n - \mi\ 



(36) 



by definition of the function 4 r (.). The last term h(ti mi ,ti mi ) fsi, mi is slightly more difficult to 
handle. Let us first decompose h(ti iTni ^t imi ): 



(€i + £t,m< ) X mi (X m X mi ) (X X 7ni 



by Lemma 10.1 and definition of e 2 . mi . Observe that + ti >mi is independent of X mi . Hence, 
conditionally to X TOi , li{ti_ mi ,ti_ mi ) only depends on 6j + £i, mi through its orthogonal projection 
onto the space generated by (Kj)j emi . Meanwhile, 's^ mi = HII^ (e, + e ijmi )|| 2 is the orthogonal 
projection of (e^ + £i, mi ) along the same subspace. Thus, li(ti tmi ,U !mi ) and si, mi are independent 
conditionally to X mj . Moreover, s"i iTOi is independent of Xi jTOi . Hence, h(ti tmi ,ti mi ) and Sj imj are 
independent. Following the proof of Lemma 2.1 in [28], we observe that E[Zj(ti )TOi ,tj )TOi )] is the 
expectation of the trace of an inverse Wishart Wish~ l {\mi\, n) times Si >mj . We then obtain that 



E 



= E 



Wish 1 (|mi|,n) 



X 2 (n - |mi|)/n 



l)(n 



2) 



(37) 



since E [Wis/i 1 (|mi|,n)] = |rrii|/(n— — 1) by Von Rosen [26]. Gathering identities (35), (36), 
and (37) with (34) yields the first result (5). Let us now compute the function '£(.). 



Lemma 10.2. For any d larger than 3. 

1 



d-2 



< #(d) < and #(d) 



1 

d 



d 2 



The proof is given in the technical Appendix [29]. Since log(l — d/n) is negative, we obtain the 
first upper bound on R n< d- For any positive number x, log(l + x) < x and consequently log(l — x) is 
smaller than —xj (1 — x) for any x such that < x < 1. It then follows that ^(n— d)+log(l — d/n) > 
— (d + l)/(n — d — 2) and R n .d > (d + l)/[2(n — d — 2)]. Analogously, we obtain the expansion of 
i^n.c; when d/n goes to thanks to Lemma 10.2 and the Taylor expansion of the logarithm. □ 



10.3. Proof of the risk upper bounds 

10.3.1. Proof of the main theorem 

Proof of Theorem 4-4- This result is based on a Kullback oracle inequality for all the estimators 
(<i,Si) with 1 < i < p. Consider an integer 1 < i < p. 

Assumption (W K „): Given K > 1 and 77 > 0, the collection M. and the number rj satisfy 



1 + ^^(Ki) 



n — \m,i 



< i] < n(K) 



(38) 
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where we recall that rj(K) is defined in Eq.(12) in [28]. 

Obviously, Assumption (Hjf : ^) is equivalent to the union of the assumptions (H^-^). 

Proposition 10.3. Let K > 1 and rj < r](K). Assume that n > n^(K), that (H^- ) holds, and 
that the penalty function is lower bounded as follows 

\m\ 



peni(m) > K — — p — - ( 1 + y / 2H l (\m\) ) for any m £ At, and some K > 1 . (39) 



Then, the penalized estimator (U,Si) satisfies 

E [K. (ti,Si;ti,Si)] < L K v inf [E [K (t h sf, % m , s; m )l + pera^m)] + r„ [t h s i} K, rj] . 

nii£Mi 

The remaining term T n (ti, Si, K,rf) is defined by 

r n [ti,Si,K,rj\ := — + L'(K,i])n 5/2 [1 + JC (i l ,s l ;0, 1)] cxp [-nL K , v ] , 

where stands here for the null vector of size i — 1 . 

Let us apply this property for any i between 1 and p. Then, we get an upper bound for E[/C(0; 51)] 
by applying the chain rule as in Section 4.1. The risk bound (8) follows. □ 

Proof of Proposition 10.3. The proof of this result is mainly inspired by ideas introduced in the 
proofs of Th.3 in [4] and of Th.3.4 in [28]. The case i = 1 is a consequence of Proposition 4.1 since 
| All | = 1. Let us assume that i is larger than one. For the sake of clarity, we forget the subscripts 
i in the remainder of the proof. 

Let us introduce some new notations. First, (., .)„ is the inner product in M n associated to the 
norm \\.\\ n . Let m be any model in the collection AL 

We shall use the constants K\, k 2 , and v(K ) as defined in the proof of Th.3.4 in [28]. We provide 
their expression for completeness although they are not really of interest. 



,^2 {K-l)[l-^] 2 [l-^-v{K)] 2 
Kl : ~ i 7= TT7\ ' K2 := Tr A 1 



1/6 l 

v{K) := (— — I A 



1/6 



K + 2J 2 
Besides, we introduce the positive constant kq as the largest number that satisfies 

Kn < 1 and < 1 — Kn ■ 

K + 1 3 ~ V ; 2.5 

For clarity, the proof is split into six lemmas. 
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Lemma 10.4. 



2(1 - Ka)K [t,s;t,s] < 2K [t,s;t m ,s m ] + (1 - n )pen{m) + ^4 - [-Ri(m) V (1 - re 2 )(l - «o)] 

+ i?2(«i.) + -R^{fh) + R^(m, fh) , 
s 



where for all model m! £ M., 
Ri(m') := «i + 1 — Ko 



-H l|n m '(e + e 



2 



l{t m >,t) 



l{t m ,,t) + s 



K(l- Ko ) l + y/2H(\m>\) 



in 



X 



<-m>)\\ 2 n 



R 2 (m) 
R 3 {m') 



2 (n m e i n, m e m ) n 



?i — \m'\ l(t m ',t) + s 



1 d(i^t) 2( ~ °^ mm L " (Zm ' Zm,) J l(t m ,,t)- 

\m'\ ||n^,(e + € 



2 
II n 



III /ell 2 

J-J-m' c || n 
S 

2 



-if(l-Ko) l + y/2H(\m'\) 



|m'| + a 



i? 4 (m, m') := (||e||^ — s(l — k )) 



3 m °m 



This lemma gives a decomposition of the relevant terms that we have to bound. See [29] Sect. 1.1 
for a detailed computation. In the next four lemmas, we bound each of these terms. 

Lemma 10.5. Let us assume that n > hq(K), where no(K) is defined in the proof. There exists 
an event M± of probability larger than 1 — Lk n exp [— nL'(K, r])] with L'[K,r\) > such that 

i?i(m)l Bl < v(K,r])(l - K ) , 
where v(K,r)) is a positive constant (strictly) smaller than 1. 

Lemma 10.6. Assume that n > uq(K). Then, under the event Bi defined in the proof of Lemma 
10.5, 



3 



-i? 3 (m)l Bl 



< 



J K,T] 



These two upper bounds are at the heart of the proof. The sketch of their proofs is analogous 
to Lemmas 7.10 and 7.11 in [28]. The main tools are deviation inequalities of x 2 random variables 
and of the largest eigenvalue of a Wishart matrix. See [29] Sect. 1.2 and 1.3 for detailed proofs. 

Since l(t, t) /J is smaller than 2/C [t, s; t, s] , it follows that 
2E [K, (t, s;t,s) IbJ < Lk, v {2E [K (t, s;t m , s" m )] + pen(m) + E [(R 2 (m) + i? 4 (m, m))^]} . 

Lemma 10.7. Assume that n > hq(K). Considering the event Bi defined in Lemma 10.5, we 
bound R2(m) by 



E[i? 2 (m)l Bl ] < 
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See [29] Sect. 1.4 for a detailed proof. 

Lemma 10.8. Assume that n > no(K). Considering the event Bi defined in Lemma 10.5, we 
bound R^(m) by 

E [Ri(m, m)lBi] < Lpen(m) + nexp [—uLk] ■ 

The proofs of this lemma relies on the same ideas as the proofs of Lemma 3 in [4]. See [29] 
Sect. 1.5 for a detailed proof. 

Gathering these two lemmas, we control the Kullback risk of (t, s) on the event Bi 

2E [K. (t, s;t,s) IbJ < Lk, v {2E [JC (t, s;t m ,s m )] + pen(m)} 

+ — + (n + L)cxp[-~nL K ] . (40) 
n 

To conclude, we need to control the Kullback risk of the estimator (t, s) on the event BJ. 

Lemma 10.9. Outside the event Bi . the Kullback risk is upper bounded as follows: 

E [K(t,s;t,s) 1 B «] < L K , v n 5/2 [1 + K(t, s; 0, 1)] exp [-nL K ] ■ 

This lemma is based on Holder's inequality and on an upper bound of the moments of the para- 
metric losses K.(t, s; t m ,'s m ). A detailed proof is in the technical Appendix [29] Sect. 1.6. Combining 
(40) and Lemma 10.9 allows to conclude 

E[K(t,s;t,s)] < L K ^ [E [K (t, s;t m ,s m )] +pen(m)] H — — 
+ L K ^n 5/2 [1 +K(t,s;0, l)]exp [-nL K ] ■ 

□ 

10.3.2. Proof of the corollaries 

Proof of Corollary 5.1. The functions Hi{.) equal for all the collections A^f ord . Hence, the col- 
lections satisfies We conclude by gathering Proposition 4.1 and Theorem 4.4. □ 

Proof of Corollary 6.1. First, we claim that for any K > 1 the penalties (21) are lower bounded by 
penalties defined in (7) with some K' > 1 if 

\ mi \/(n - K|) jl + ^2[l + bg((i-l)/|mi|)] a j < i/(K) . 

If we assume that d[l + log(p/d) V 0] < nn'(K), for some well chosen function T]'(K), then (Hx\, ( ) is 
fulfilled and that the risk bound (23) holds. A detailed proof is in the technical Appendix citetech- 
nical Sect. 1.7. 

□ 

Proof of Proposition 7.1. Under the event A m , the model m belongs to the collection A^i x . . .xM p . 
Hence for any i in 1, . . .p, log(s i -/) + pen(fh{) < log(s"j )TOi ) + pen(m,). The rest of the proof is 
analogous to the proof of Theorem 4.4. □ 
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10-4- Proofs of the minimax bounds 

The minimax bounds are based on Fano's method [32]. Since the Kullback discrepansy is not 
a distance, we cannot directly apply this method. Instead, we use a modified version of Birgc's 
lemma [7] for covariance estimation. In the sequel, we note \\t\\i 2 the Euclidean norm of a vector t. 

Lemma 10.10. Let A be a subset of {1, . . . ,p}. For any positive matrices Q and Q! , we define the 
function d(VL, f2') by 



d(Sl,Q!) :=^log 



fill 



/||2 

h 



(41) 



Let T be a subset of square matrices of size p which satisfies the following assumptions: 

1. For all flgT, tp m ax(Q) < 2 and ip m i n (Q) > 1/2. 

2. There exi sts (si,s 2 ) € [1;2] 2 such thatVn G T, VI < i < p, s. t G {si,s 2 }- 

Setting S = minn,n'eT,n^Q' d(Cl, il'), provided that maxf^Q/gx ^(Pa"; ^n") — K i 1°§ ^ e follow- 
ing lower bound holds 



inf sup En /C I f2; f2 ) > K2<5 • 

The numerical constants n\ and Ki are made explicit in the proof. 

The general setup of the proofs is to pick a maximal subset T of matrices that are well separated 
with respect to d(., .) and such that their Kullback discrepansy is not too large. The existence of 
these subsets is ensured by technical combinatorial arguments. We postpone the complete proofs 
to the technical appendix [29] Sect. 2. 

10.5. Proof of the Frobenius bounds 

We derive the Frobenius rates of convergence from the Kullback bounds. Indeed, we prove in [29] 
that 



\\V^n'VE-I Pn \\ 2 F = A[]C{Q;Q,')}+o[}C{n;Q')} , 



(42) 



when K, (f2; fi') is close to 0. Hence, one may upper bound the Frobenius distance between f2' and 
£1 in terms of Kullback discrepancy using that 



\n'-n\\ 2 F 



tr 



Vti ( VEn'Vr, -L p „)tt( Veci'VT, -i^)Vn 



< ^LxWIIv^fiVs-ipX. 

The complete proof of Corollaries 5.4 and 6.3 are postponed to the technical Appendix [29] Sect. 4. 
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